About the project

Data Science - Let’s get this started!

This is the link to my github repository: https://github.com/LonavanDelden/IODS-project.git

here is the link to the diary page


2. Regression and model validation

Starting the chapter 2 exercise by ensuring the right working directory and reading in the created data set from the data wrangling exercise. My data set was saved as ‘learning2014_1.csv’.

Reading the data as ‘student2014’ including 7 variables in 7 columns and 166 observations overall. The first column ‘gender’ is the factor used to categorize the observations in the following statistical analysis.

setwd("C:/Users/lonav/Documents/IODS-project")
students2014 <- read.table("learning2014_1.csv", sep=",", header=T)
str(students2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...

This paired data overview shows the relationship of each variable to each other. First, more data are available from females than males. According to the line charts, the distribution of most variables is simular when comparing female and male results. The boxplots show the general distribution of the data and how wide spread they are in relation to the mean. In this case highlighting ‘age’ with the most outliers, meaning the data are mostly on young people in their 20s and 30s. The scatter plots visualize the interaction of the variables indicating any correlation present. Another way of looking into the correlation is with the correlation coefficient ‘Cor’, determining the relationship of two variables by giving it a value that can easily be compared (ranging from 0 to 1 or -1, with the strongest correlation at 1, negative values just mean that if one variable increases that the other decreases instead of increasing as well). For this data set we look mainly into what can influence the ‘points’ of someone, so we focus on the correlations coefficient in that column. The strongest relationship is between ‘points’ and ‘attitude’ followed by ‘strat’ and ‘surf’.

library(GGally)
## Loading required package: ggplot2
library(ggplot2)
p <- ggpairs(students2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p

The variables ‘attitude’, ‘stra’, and ‘surf’ were choosen based on their correlation coefficient with the dependent variable ‘points’. This model gives descriptive statistics of the data, such as the minimum, maximum and median values as well as the standard error and R-squared. The intercept tells where the modeled outcome equals zero. Additionally, it predicts the amount of change for each variable when the dependant variable ‘points’ increases one unit. E.g. when ‘points’ increases on unit, ‘attitude’ increases by 3.4, ‘stra’ increases by 0.9 and ‘surf’ decreases by 0.6. However, the p value indicates if this relationship is significant or not, which just means how likely it is that the model prediction is accurate. According to the model output, from the three variables only ‘attitude’ is highly significant.

library(ggplot2)
my_model <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model)
## 
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## stra          0.8531     0.5416   1.575  0.11716    
## surf         -0.5861     0.8014  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

Based on the outcome of the first model, ‘stra’ and ‘surf’ will now be excluded from the model. The estimated outcome has changed a little and the standard error has decreased.The multiple R-squared of 0.2 indicates a weak (but highly significant) positive relationship. This means that a change in attitude has a weak effect on the Points, meaning increases the points a little when increasing itself.

library(ggplot2)
my_model2 <- lm(points ~ attitude, data = students2014)
summary(my_model2)
## 
## Call:
## lm(formula = points ~ attitude, data = students2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.6372     1.8303   6.358 1.95e-09 ***
## attitude      3.5255     0.5674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

The model assumes that the data are normally distributed and the relationshsip between the variables is linear. The following diagnostic plots can be used to check on the validity of these assumtions. The ‘Residuals vs Fitted’ plot is indicating that the relationship is linear by showing a mostly horizontal line, but could be improved with e.g. higher sample size or additional variables in the model. The ‘Normal Q-Q’ plot is also mostly supporting the assumptions of normal distribution by alligning the data along the line. The ‘Residuals vs Leverage’ shows that there are a few outliers but not extremly influencial. Overall, the diagnostic plots indicate that the assumptions of the model are valid.

plot(my_model2, which=c(2,1,5))

before knitting:

install.packages(“GGally”) library(GGally) library(ggplot2)


3. Logistic regression

Starting the chapter 3 exercise by ensuring the right working directory and reading in the created data set from the data wrangling exercise and print the variable names. My data set was saved as ‘alc.csv’. The data set includes 382 observations with 35 variables. The observations are from Portuguese students which now will be analysed for their alcohol consumption and to identify which variables are related factors. The variables in question are listed below in the column names, with emphasis on the computed alcohol use “alc_use”, which is the avergae of daily and weekly alcohol consumption ‘(Dalc+Walc)/2’. If this use is higher than 2 than the logical factor “high_use” appears TRUE. This high alcohol usage will be the target variable for the exercise.

getwd()
## [1] "C:/Users/lonav/Documents/IODS-project"
setwd("C:/Users/lonav/Documents/IODS-project")
library(openxlsx)
alc <- read.csv("alc.csv", sep = ",", header = TRUE)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"
glimpse(alc)
## Observations: 382
## Variables: 35
## $ school     <fct> GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP, GP,...
## $ sex        <fct> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, ...
## $ age        <int> 18, 17, 15, 15, 16, 16, 16, 17, 15, 15, 15, 15, 15,...
## $ address    <fct> U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, U, ...
## $ famsize    <fct> GT3, GT3, LE3, GT3, GT3, LE3, LE3, GT3, LE3, GT3, G...
## $ Pstatus    <fct> A, T, T, T, T, T, T, A, A, T, T, T, T, T, A, T, T, ...
## $ Medu       <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, ...
## $ Fedu       <int> 4, 1, 1, 2, 3, 3, 2, 4, 2, 4, 4, 1, 4, 3, 2, 4, 4, ...
## $ Mjob       <fct> at_home, at_home, at_home, health, other, services,...
## $ Fjob       <fct> teacher, other, other, services, other, other, othe...
## $ reason     <fct> course, course, other, home, home, reputation, home...
## $ nursery    <fct> yes, no, yes, yes, yes, yes, yes, yes, yes, yes, ye...
## $ internet   <fct> no, yes, yes, yes, no, yes, yes, no, yes, yes, yes,...
## $ guardian   <fct> mother, father, mother, mother, father, mother, mot...
## $ traveltime <int> 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 3, 1, 2, 1, 1, 1, ...
## $ studytime  <int> 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 3, 1, 2, 3, 1, 3, ...
## $ failures   <int> 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ schoolsup  <fct> yes, no, yes, no, no, no, no, yes, no, no, no, no, ...
## $ famsup     <fct> no, yes, no, yes, yes, yes, no, yes, yes, yes, yes,...
## $ paid       <fct> no, no, yes, yes, yes, yes, no, no, yes, yes, yes, ...
## $ activities <fct> no, no, no, yes, no, yes, no, no, no, yes, no, yes,...
## $ higher     <fct> yes, yes, yes, yes, yes, yes, yes, yes, yes, yes, y...
## $ romantic   <fct> no, no, no, yes, no, no, no, no, no, no, no, no, no...
## $ famrel     <int> 4, 5, 4, 3, 4, 5, 4, 4, 4, 5, 3, 5, 4, 5, 4, 4, 3, ...
## $ freetime   <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, ...
## $ goout      <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, ...
## $ Dalc       <int> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Walc       <int> 1, 1, 3, 1, 2, 2, 1, 1, 1, 1, 2, 1, 3, 2, 1, 2, 2, ...
## $ health     <int> 3, 3, 3, 5, 5, 5, 3, 1, 1, 5, 2, 4, 5, 3, 3, 2, 2, ...
## $ absences   <int> 6, 4, 10, 2, 4, 10, 0, 6, 0, 0, 0, 4, 2, 2, 0, 4, 6...
## $ G1         <int> 5, 5, 7, 15, 6, 15, 12, 6, 16, 14, 10, 10, 14, 10, ...
## $ G2         <int> 6, 5, 8, 14, 10, 15, 12, 5, 18, 15, 8, 12, 14, 10, ...
## $ G3         <int> 6, 6, 10, 15, 10, 15, 11, 6, 19, 15, 9, 12, 14, 11,...
## $ alc_use    <dbl> 1.0, 1.0, 2.5, 1.0, 1.5, 1.5, 1.0, 1.0, 1.0, 1.0, 1...
## $ high_use   <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FAL...

The variables failures, absences, sex and age are choosen as predictors. The hypotheses are that failures, absences and age increase with alcohol consumption. Due to popular consensus women will have less high alcohol usage than men. Findings: Failures, absences and sex were significant predictors. Age, on the other hand, showed no significant influence on high alcohol usage. Therefore, the hypothesis that failures and absences increase with high alcohol usage can be confirmed but for the impact of age is no confidence. The boxplot and cross table highlight the lower alcohol usage by women compared to men, which confirms the hypothesis.

Numerical exploration with the logistic regression mnodel:

library(GGally)
library(ggplot2)
m <- glm(high_use ~ failures + absences + sex + age, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + age, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6644  -0.8051  -0.6026   1.0478   2.1115  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.82153    1.72779  -2.212   0.0270 *  
## failures     0.37915    0.15333   2.473   0.0134 *  
## absences     0.07059    0.01795   3.932 8.43e-05 ***
## sexM         0.98875    0.24475   4.040 5.35e-05 ***
## age          0.11375    0.10394   1.094   0.2738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 417.44  on 377  degrees of freedom
## AIC: 427.44
## 
## Number of Fisher Scoring iterations: 4

Graphical exploration:

library(ggplot2)
g <- ggplot(alc, aes(x = high_use, y = absences, col = sex))
g + geom_boxplot() + ylab("absences") + ggtitle("Student absences by alcohol consumption and sex")

g <- ggplot(alc, aes(x = high_use, y = age, col = sex))
g + geom_boxplot() + ylab("age") + ggtitle("Student age by alcohol consumption and sex")

select(alc, high_use, failures, absences, sex, age) %>% tail(10)
##     high_use failures absences sex age
## 373    FALSE        1        0   M  19
## 374     TRUE        1       14   M  18
## 375    FALSE        0        2   F  18
## 376    FALSE        0        7   F  18
## 377    FALSE        1        0   F  19
## 378    FALSE        0        0   F  18
## 379    FALSE        1        0   F  18
## 380    FALSE        1        0   F  18
## 381     TRUE        0        3   M  17
## 382     TRUE        0        0   M  18
table(high_use = alc$high_use, failures = alc$failures)
##         failures
## high_use   0   1   2   3
##    FALSE 234  22   5   9
##    TRUE   82  16   6   8
table(high_use = alc$high_use, absences = alc$absences)
##         absences
## high_use  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21
##    FALSE 94  2 54  2 36  4 19  6 13  3 12  1  6  0  7  1  2  0  2  0  1  1
##    TRUE  22  1 13  4 15  0 11  2  8  0  5  1  4  3  4  2  4  1  2  1  1  0
##         absences
## high_use 22 23 24 25 26 28 30 54 56 75
##    FALSE  0  1  0  1  1  0  0  0  0  1
##    TRUE   3  0  1  0  0  1  1  1  1  0
table(high_use = alc$high_use, sex = alc$sex)
##         sex
## high_use   F   M
##    FALSE 157 113
##    TRUE   41  71
table(high_use = alc$high_use, age = alc$age)
##         age
## high_use 15 16 17 18 19 20 22
##    FALSE 63 79 65 55  7  1  0
##    TRUE  18 28 35 26  4  0  1

The model was adjusted based on the insignificant influence of the age variable (see the insignificant ANOVA output at the bottom of the next window), which increases the significance level of the other variables. The remaining three variables are positively correlated to alcohol usage.

m1 <- glm(high_use ~ failures + absences + sex + age, data = alc, family = "binomial")
summary(m1)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + age, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6644  -0.8051  -0.6026   1.0478   2.1115  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.82153    1.72779  -2.212   0.0270 *  
## failures     0.37915    0.15333   2.473   0.0134 *  
## absences     0.07059    0.01795   3.932 8.43e-05 ***
## sexM         0.98875    0.24475   4.040 5.35e-05 ***
## age          0.11375    0.10394   1.094   0.2738    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 417.44  on 377  degrees of freedom
## AIC: 427.44
## 
## Number of Fisher Scoring iterations: 4
m2 <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
summary(m2)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6629  -0.8545  -0.5894   1.0339   2.0428  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.95397    0.22819  -8.563  < 2e-16 ***
## failures     0.40462    0.15024   2.693  0.00708 ** 
## absences     0.07294    0.01796   4.061 4.88e-05 ***
## sexM         0.98848    0.24453   4.042 5.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 462.21  on 381  degrees of freedom
## Residual deviance: 418.64  on 378  degrees of freedom
## AIC: 426.64
## 
## Number of Fisher Scoring iterations: 4
anova(m1, m2, test="LRT")
## Analysis of Deviance Table
## 
## Model 1: high_use ~ failures + absences + sex + age
## Model 2: high_use ~ failures + absences + sex
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       377     417.44                     
## 2       378     418.64 -1  -1.2029   0.2727

Odds ratios and their confidence interval are explored:

The odds are that with high alcohol usage it is 2.7 times more likely that the student is a man, 1.5 times more likely that the student fails and 1.1 times more likely to be absent. This outcome shows that sex is the strongest predictor of high alcohol usage. The confidence interval indicates the prediction range of the computed mean, i.e. we are up to 97.5% confident that the odds of the student with high alcohol usage is a man lie between 1.6 and 4.4 times.

m <- glm(high_use ~ failures + absences + sex, data = alc, family = "binomial")
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1417107 0.08883883 0.2178283
## failures    1.4987270 1.11549818 2.0187171
## absences    1.0756623 1.04072883 1.1163576
## sexM        2.6871365 1.67434331 4.3755694

Predictions computed and visualized by a 2x2 cross tabulation: The predictions were 284 times correct (FALSE=FALSE & TRUE=TRUE) and 98 times incorrect (FALSE=TRUE), i.e. more than a third of the predictions did not match the observations.

probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, failures, absences, sex, high_use, probability, prediction) %>% tail(10)
##     failures absences sex high_use probability prediction
## 373        1        0   M    FALSE   0.3633449      FALSE
## 374        1       14   M     TRUE   0.6130701       TRUE
## 375        0        2   F    FALSE   0.1408685      FALSE
## 376        0        7   F    FALSE   0.1910175      FALSE
## 377        1        0   F    FALSE   0.1751799      FALSE
## 378        0        0   F    FALSE   0.1241213      FALSE
## 379        1        0   F    FALSE   0.1751799      FALSE
## 380        1        0   F    FALSE   0.1751799      FALSE
## 381        0        3   M     TRUE   0.3215447      FALSE
## 382        0        0   M     TRUE   0.2757800      FALSE
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   258   12
##    TRUE     86   26
library(dplyr)
library(ggplot2)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67539267 0.03141361 0.70680628
##    TRUE  0.22513089 0.06806283 0.29319372
##    Sum   0.90052356 0.09947644 1.00000000

The training error & cross-validation:

The training error indicates the average number of wrong predictions of about 26%. After 10-fold cross-validation, my model error decresed a notch to 25%. with Depending on the scientific field (I am from environmental science) this is quite high and could indicate missing variables with high influence. However, when it comes to my hypotheses, the model clearly proofed me wrong about age being an influencial variable in this data set. Age might become a significant factor if the observations would range throughout the whole lifetime (bigger data set). I would still guess that student in their 40s are less likely to correlate with high alcohol consumption but there are no observations in this data set on this. But the other 3 hypotheses were confirmed, sometime even statistics support logical thinking :)

loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2565445
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc))
cv$delta[1]
## [1] 0.2591623
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta[1]
## [1] 0.2617801

before kniting: library(GGally) library(ggplot2)


4. Clustering and classification

Read in the data and explore. The dataset is on Housing Values in Boston Suburbs in the USA. The variables include crime statistics and information on residency types and values, such as access to infrastructure, schooling etc. Please find the summary below.

library(MASS)
data("Boston")
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The graphical overview shows that the variables are not normally distributed (maybe except for “rm” - rooms per dwelling).

The matrix is highlighting the strongest positive correlation between the accessibility to radial highways (“rad”) and the full-value property-tax rate (“tax), which means that the property taxes increase substantially when it is close to long distance connections, which is crucial for many industries.

The strongest negative correlations are between the weighted mean of distances to five Boston employment centres (“dis”) with age of the building, nitrogen oxide concentrations (“nox”) and the proportion of non-retail business acres per town (“indus”). This means when the distance to the employment centres increases, the buildung age, nox levels and non-retail businesses decreases, which makes sense as new buildings are more common in the outskirt of town (easier to build new than to renovate), less polution from traffic and the businesses like the crowded center better as well. Another strong negative correlation is between the lower status of the population (“lstat”) and the median value of owner-occupied homes (“medv”), which is a no-brainer.

library(corrplot)
library(tidyverse)
library(GGally)
ggpairs(Boston, upper = list(continuous = wrap("cor", size=2.6)))

cor_matrix <- cor(Boston) %>% round(digits = 2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

The data set is standardized by scaling the variables according to their mean and standard deviation. Please see the summary below. The mean is now 0 for all the variables and the values are overall a lot lower as they are now scaled to their vairiable dependent values instead of real values.

boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
n <- nrow(boston_scaled)
ind <- sample(n,  size = n * 0.8)
train <- boston_scaled[ind,]
train_cluster <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
str(train)
## 'data.frame':    404 obs. of  14 variables:
##  $ zn     : num  0.0487 -0.4872 -0.4872 -0.4872 0.3703 ...
##  $ indus  : num  -0.476 1.015 1.015 1.231 -1.138 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.265 0.658 0.512 0.434 -0.965 ...
##  $ rm     : num  -0.1603 -0.0977 -0.9061 0.0489 0.7506 ...
##  $ age    : num  0.978 1.116 0.676 0.978 -1.292 ...
##  $ dis    : num  1.024 -1.247 -0.876 -0.805 0.145 ...
##  $ rad    : num  -0.522 1.66 1.66 -0.522 -0.522 ...
##  $ tax    : num  -0.5769 1.5294 1.5294 -0.0311 -1.1406 ...
##  $ ptratio: num  -1.504 0.806 0.806 -1.735 -1.642 ...
##  $ black  : num  0.441 0.104 -0.713 -0.653 0.441 ...
##  $ lstat  : num  0.91 -0.437 0.203 -0.217 -1.093 ...
##  $ medv   : num  0.497 2.987 -0.188 0.138 1.366 ...
##  $ crime  : Factor w/ 4 levels "low","med_low",..: 2 4 4 3 2 1 4 1 2 4 ...
str(test)
## 'data.frame':    102 obs. of  13 variables:
##  $ zn     : num  -0.4872 -0.4872 -0.4872 0.0487 -0.4872 ...
##  $ indus  : num  -0.593 -0.593 -1.306 -0.476 -0.437 ...
##  $ chas   : num  -0.272 -0.272 -0.272 -0.272 -0.272 ...
##  $ nox    : num  -0.74 -0.74 -0.834 -0.265 -0.144 ...
##  $ rm     : num  0.194 1.281 1.227 0.131 -0.455 ...
##  $ age    : num  0.367 -0.266 -0.511 0.914 0.733 ...
##  $ dis    : num  0.557 0.557 1.077 1.212 0.103 ...
##  $ rad    : num  -0.867 -0.867 -0.752 -0.522 -0.637 ...
##  $ tax    : num  -0.986 -0.986 -1.105 -0.577 -0.601 ...
##  $ ptratio: num  -0.303 -0.303 0.113 -1.504 1.175 ...
##  $ black  : num  0.441 0.396 0.441 0.393 0.393 ...
##  $ lstat  : num  -0.492 -1.208 -1.025 1.092 0.165 ...
##  $ medv   : num  -0.101 1.323 1.486 -0.819 -0.319 ...
lda.fit <- lda(crime ~ ., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2450495 0.2326733 0.2772277 0.2450495 
## 
## Group means:
##                   zn      indus         chas        nox         rm
## low       0.94892733 -0.9206628 -0.192791689 -0.8650720  0.3497630
## med_low  -0.08036353 -0.2982714  0.020859248 -0.5811737 -0.1713901
## med_high -0.37851602  0.1525858  0.184655781  0.3289455  0.1238729
## high     -0.48724019  1.0171737  0.006051757  1.0695003 -0.3655712
##                 age        dis        rad        tax    ptratio      black
## low      -0.8479146  0.9003209 -0.6790935 -0.7515341 -0.3303222  0.3786931
## med_low  -0.3025861  0.3695700 -0.5530287 -0.4585954 -0.1067304  0.3469493
## med_high  0.3797765 -0.3528106 -0.4178919 -0.3226060 -0.2161872  0.1025512
## high      0.8335141 -0.8700173  1.6375616  1.5136504  0.7801170 -0.8030721
##                lstat        medv
## low      -0.72557085  0.42553159
## med_low  -0.10682592 -0.02658537
## med_high -0.02099663  0.16515844
## high      0.92394483 -0.71140951
## 
## Coefficients of linear discriminants:
##                  LD1         LD2         LD3
## zn       0.114150794  0.68537329 -0.93615060
## indus    0.001159763 -0.34104911  0.20820509
## chas    -0.078117647 -0.03082535  0.14872109
## nox      0.381971967 -0.66353085 -1.47796109
## rm      -0.098805633 -0.07769769 -0.14646322
## age      0.241164801 -0.31802645 -0.07431742
## dis     -0.097000826 -0.22407252  0.14829455
## rad      3.125226749  0.89539576 -0.20328895
## tax      0.029452557  0.04200141  0.90837315
## ptratio  0.125194195  0.03694319 -0.46800763
## black   -0.140334907  0.03046298  0.15527895
## lstat    0.254982006 -0.17699385  0.36585616
## medv     0.188130233 -0.37132806 -0.23115094
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9477 0.0381 0.0142
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes <- as.numeric(train$crime)
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1.4)

The LDA model results in about a third of the classes being predicted wrong when compared to the real data (70 classes correct as in low=low etc. and 32 classes wrong as in low=low_med etc).

lda.pred <- predict(lda.fit, newdata = test)
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       18       9        1    0
##   med_low    9      15        8    0
##   med_high   0       2       11    1
##   high       0       0        0   28

Working on the original dataset “Boston”“, which is scaled again. The data are prepared, explored and clustered by K means. The k plot determined two clusters with the strong change below 2.5. Therefore the clusters for the visualization was set to two. The matrix visualizes all variables in when plotted against each other, therefore the top and the bottom are mirrored.

library(MASS)
data('Boston')
boston_scaled1 <- scale(Boston)
boston_scaled1 <- as.data.frame(boston_scaled1)
dist_eu <- dist(boston_scaled1)
dist_man <- dist(boston_scaled1, method = "manhattan")
summary(dist_eu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1343  3.4625  4.8241  4.9111  6.1863 14.3970
summary(dist_man)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2662  8.4832 12.6090 13.5488 17.7568 48.8618
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled1, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')

km <-kmeans(boston_scaled1, centers = 2)
pairs(boston_scaled1, col = km$cluster)

To see the variables more clearly, the plot was cut into smaller pieces of 5 variables max. Not all variables show a clear clustering when compared to each other but e.g. proportion of non-retail businesses (indus), nitrogen oxides (nox),building age (age) and median home value (medv) seperate quite well into clusters when compared to other variables. Especially nitrogen oxides (nox) show clear cluster seperation around the mean when paired with age, rm and dis etc. Just to clarify, the data are scaled, which means in this case the cluster seperate the data in relation to the mean.

library(MASS)
km <-kmeans(boston_scaled1, centers = 2)
pairs(boston_scaled1[1:5], col = km$cluster)

pairs(boston_scaled1[5:10], col = km$cluster)

pairs(boston_scaled1[10:14], col = km$cluster)

K-means performed on the original Boston data and standarzied by scaling with 3 clusters. The proportion of residential land zoned for lots over 25,000 sq.ft.s (zn) is the most influencial variable.

library(MASS)
library(ggplot2)
data('Boston')
boston_scaled2 <- scale(Boston)
km <-kmeans(boston_scaled2, centers = 3)
cluster <- km$cluster
boston_scaled2 <- data.frame(boston_scaled2, cluster)
lda.fit2 <- lda(cluster ~ ., data = boston_scaled2)
lda.fit2
## Call:
## lda(cluster ~ ., data = boston_scaled2)
## 
## Prior probabilities of groups:
##         1         2         3 
## 0.2470356 0.4268775 0.3260870 
## 
## Group means:
##         crim         zn      indus         chas        nox         rm
## 1 -0.3989700  1.2614609 -0.9791535 -0.020354653 -0.8573235  1.0090468
## 2 -0.3788713 -0.3578148 -0.2879024  0.001080671 -0.3709704 -0.2328004
## 3  0.7982270 -0.4872402  1.1186734  0.014005495  1.1351215 -0.4596725
##           age        dis        rad        tax     ptratio      black
## 1 -0.96130713  0.9497716 -0.5867985 -0.6709807 -0.80239137  0.3552363
## 2 -0.05427143  0.1034286 -0.5857564 -0.5951053  0.01241316  0.2805140
## 3  0.79930921 -0.8549214  1.2113527  1.2873657  0.59162230 -0.6363367
##        lstat        medv
## 1 -0.9571271  1.06668290
## 2 -0.1047617 -0.09820229
## 3  0.8622388 -0.67953738
## 
## Coefficients of linear discriminants:
##                 LD1         LD2
## crim    -0.03206338  0.19094456
## zn       0.02935900  1.07677218
## indus    0.63347352  0.09917524
## chas     0.02460719 -0.10009606
## nox      1.11749317  0.75995105
## rm      -0.18841682  0.57360135
## age     -0.12983139 -0.47226685
## dis      0.04493809  0.34585958
## rad      0.67004295  0.08584353
## tax      1.03992455  0.58075025
## ptratio  0.25864960  0.02605279
## black   -0.01657236 -0.01975686
## lstat    0.17365575  0.41704235
## medv    -0.06819126  0.79098605
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8506 0.1494
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}
classes2 <- as.numeric(boston_scaled2$cluster)
plot(lda.fit2, dimen = 2, col = classes2, pch = classes2, main = "LDA biplot using three clusters 1, 2 and 3")
lda.arrows(lda.fit2, myscale = 1.4)

boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
lda.fit <- lda(crime ~., data = train)
model_predictors <- dplyr::select(train, -crime)
dim(train_cluster)
## [1] 404  14
dim(model_predictors)
## [1] 404  13
dim(lda.fit$scaling)
## [1] 13  3
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling # if %*% is not working, check the dimensions! Second of first line needs to match first of the second line
matrix_product <- as.data.frame(matrix_product)
library(plotly)
p1 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
p1
p2 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train_cluster$crime)
p2
train2 <- dplyr::select(train, -crime)
km3 <-kmeans(train2, centers = 3)
p3 <- plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km3$cluster)
p3

before kniting: library(GGally) library(ggplot2)


5. Dimensionality reduction techniques

The data are from the United Nations Development Programme to look into various variables to assess the Human Development Index (HDI) with more than just economic growth.

Here are my header explanations:

[“f.sec_edu” = Proportion of females with at least secondary education “m.sec_edu” = Proportion of males with at least secondary education “f.labour” = Proportion of females in the labour force “m.labour” = Proportion of males in the labour force]

The summary shows that the observations are spreading across various scales, this suggests a standardization application later on to compare them in further analysis. The pairs graphs show that some of the variables have normally distributed observations, some do not (e.g. GNI). The stongest (negative) relationship can be found between maternal mortality ratio (mat.mor_rat) and life expectancy at birth (life), quite logically when the mortality ratio increases the life expectancy decreases. Strong positive relationships can be found between mortality ratio and birth rate as well as life expectancy and expected years of schooling (exp_edu), which is a little less obvious. Overall education seems to have a stronger influence on life expectancy, birth rate, moratlity rate etc. than labour related factors.

library(openxlsx)
library(tidyr)
library(dplyr)
library(GGally)
library(corrplot)
library(ggfortify)
human <- read.csv("human.csv", sep = ",", header = TRUE)
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ f_m.sec_edu: num  1.007 0.997 0.983 0.989 0.969 ...
##  $ f_m.labour : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ life       : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ exp_edu    : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI        : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ mat.mor_rat: int  4 6 6 5 6 7 9 28 11 8 ...
##  $ birth.rate : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ par.repr   : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
colnames(human)
## [1] "f_m.sec_edu" "f_m.labour"  "life"        "exp_edu"     "GNI"        
## [6] "mat.mor_rat" "birth.rate"  "par.repr"
summary(human)
##   f_m.sec_edu       f_m.labour          life          exp_edu     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI          mat.mor_rat       birth.rate        par.repr    
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human)

cor(human) %>% corrplot(type = "upper")

The difference between the PCA of standardized and unstandardized data is the distribution of the ploted data, which makes PC1 the sole explaining component in the unstandardized data.

pca_human <- prcomp(human)
s <- summary(pca_human)
s
## Importance of components:
##                              PC1      PC2   PC3   PC4   PC5   PC6    PC7
## Standard deviation     1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912
## Proportion of Variance 9.999e-01   0.0001  0.00  0.00 0.000 0.000 0.0000
## Cumulative Proportion  9.999e-01   1.0000  1.00  1.00 1.000 1.000 1.0000
##                           PC8
## Standard deviation     0.1591
## Proportion of Variance 0.0000
## Cumulative Proportion  1.0000
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 
## 100   0   0   0   0   0   0   0
pc_lab <- paste0(names(pca_pr),  " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

The PCA on the standardized data identifies more individual parameters which reduces the PC1 effect to about 54%, which is still the majority. Now the original variables, which are the most influencial, can be identified to be mostly secondary education related as well as the birth rate, maternal mortality ratio, life expectancy and gross national income (variable arrows go along PC1=0). Additionally, the variables group around education, life expectancy and GNI, which means that they are positively correlated to each other, while negatively correlatd to the birth rate and maternal mortality ratio.

PC1 example: Highly educated + low maternal mortality ratio -> developed countries (Europe, Scandinavia); less educated + high maternal mortality ratio -> developing countries (Africa)

PC2 is the second most influencial component but with 16% a lot less strong, representing the situtation of women in the work force by combining women in parliament and the ratio of employed women to men (variable arrows go along PC2=0). These two varibles are close to a 90 degree angle to the other variables, which means that they are not likely to be correlated.

PC2 example: High ratio of women to men with work + women in high ranking jobs -> many women in the work force (Africa, Scandinavia); low ratio of women to men with work + less women in high ranking jobs -> less women in the work force (Middle East, Asia)

The PCA on these data highlight that the Human Development Index is mainly determined by the well known country development indicators such as gross national income, birth rate and secondary education but also that women in the work force is a valuable component to distinguish human development as e.g. some African countries are not doing well in the general development comparison but distinguish themselves clearly with their rate of women in the work force and highly ranking political jobs. This might give a more positive picture of euqality develoments in these countries as women in the parliament are more likely to work towards improvements in education and mortality rates as well.

human_std <- scale(human)
summary(human_std)
##   f_m.sec_edu        f_m.labour           life            exp_edu       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI           mat.mor_rat        birth.rate         par.repr      
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
pca_human <- prcomp(human_std)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = "Country Development (53.6%)", ylab = "Women in the work force (16.2%)")

Caption: GNI = Gross National Income per capita, life = Life expectancy at birth, exp_edu = Expected years of schooling, mat.mor_rat = Maternal mortality ratio, birth.rate = Adolescent birth rate, par.repr = Percetange of female representatives in parliament, f_m.sec_edu = Ratio of females to males with at least secondary education, f_m.labour = Ratio of females to males in the labour force


The tea data are quite complex with 36 variables on qualitative measures. Further analysis is done with a subset of 6 variables.The distribution of the observations per variable is shown in the bar plot bellow and highlights that they differ very much from each other with 2 to 4 qualitative measures.

library(FactoMineR)
data("tea")
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300  36
summary(tea)
##          breakfast           tea.time          evening          lunch    
##  breakfast    :144   Not.tea time:131   evening    :103   lunch    : 44  
##  Not.breakfast:156   tea time    :169   Not.evening:197   Not.lunch:256  
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##                                                                          
##         dinner           always          home           work    
##  dinner    : 21   always    :103   home    :291   Not.work:213  
##  Not.dinner:279   Not.always:197   Not.home:  9   work    : 87  
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##                                                                 
##         tearoom           friends          resto          pub     
##  Not.tearoom:242   friends    :196   Not.resto:221   Not.pub:237  
##  tearoom    : 58   Not.friends:104   resto    : 79   pub    : 63  
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##                                                                   
##         Tea         How           sugar                     how     
##  black    : 74   alone:195   No.sugar:155   tea bag           :170  
##  Earl Grey:193   lemon: 33   sugar   :145   tea bag+unpackaged: 94  
##  green    : 33   milk : 63                  unpackaged        : 36  
##                  other:  9                                          
##                                                                     
##                                                                     
##                                                                     
##                   where                 price          age        sex    
##  chain store         :192   p_branded      : 95   Min.   :15.00   F:178  
##  chain store+tea shop: 78   p_cheap        :  7   1st Qu.:23.00   M:122  
##  tea shop            : 30   p_private label: 21   Median :32.00          
##                             p_unknown      : 12   Mean   :37.05          
##                             p_upscale      : 53   3rd Qu.:48.00          
##                             p_variable     :112   Max.   :90.00          
##                                                                          
##            SPC               Sport       age_Q          frequency  
##  employee    :59   Not.sportsman:121   15-24:92   1/day      : 95  
##  middle      :40   sportsman    :179   25-34:69   1 to 2/week: 44  
##  non-worker  :64                       35-44:40   +2/day     :127  
##  other worker:20                       45-59:61   3 to 6/week: 34  
##  senior      :35                       +60  :38                    
##  student     :70                                                   
##  workman     :12                                                   
##              escape.exoticism           spirituality        healthy   
##  escape-exoticism    :142     Not.spirituality:206   healthy    :210  
##  Not.escape-exoticism:158     spirituality    : 94   Not.healthy: 90  
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##          diuretic             friendliness            iron.absorption
##  diuretic    :174   friendliness    :242   iron absorption    : 31   
##  Not.diuretic:126   Not.friendliness: 58   Not.iron absorption:269   
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##                                                                      
##          feminine             sophisticated        slimming  
##  feminine    :129   Not.sophisticated: 85   No.slimming:255  
##  Not.feminine:171   sophisticated    :215   slimming   : 45  
##                                                              
##                                                              
##                                                              
##                                                              
##                                                              
##         exciting          relaxing              effect.on.health
##  exciting   :116   No.relaxing:113   effect on health   : 66    
##  No.exciting:184   relaxing   :187   No.effect on health:234    
##                                                                 
##                                                                 
##                                                                 
##                                                                 
## 
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
tea_time <- select(tea, one_of(keep_columns))
gather(tea_time) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))

While the PCA clearly identified the most important component by grouping the correlated influencial variables, this MCA shows a more broad distribution of variables over the dimensions ranking from 5-15% explained variances (scree plot). This means that the first two dimensions are not as different from each other in significance as in the previous data set.

So, the biplot only explains about 30% of the data variances, which means more than 2/3 of the data are not included and some insight might be missed. The individuals (observations) are represented as the blue dots and variables as red triangles. The distance between any observation or variable gives a measure of their similarity (or dissimilarity). Observations with similar profile are closed on the factor map. Despite the low percentage of explained variances, the biplot shows some pattern. But a correlation of the variable to their dimension highlights a weak relationship. Including more variable might improve the outcome but so far I would say that tea habits are randomly distributed and only relate weakly to each other.

library(factoextra)
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
fviz_screeplot(mca, addlabels = TRUE, ylim = c(0, 45))

fviz_mca_biplot(mca, ggtheme = theme_minimal())

plot(mca, invisible=c("ind"), habillage = "quali")

mca_pr <- round(100*s$importance[2, ], digits = 1)
mca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
mc_lab <- paste0(names(mca_pr),  " (", mca_pr, "%)")
fviz_mca_var(mca, choice = "mca.cor", ggtheme = theme_minimal())


6. Analysis of longitudinal data

This is the summary of all findings on the RATS data analysis:

The main difference between the data sets is that BPRS has two variables (after factoring) while RATS has three. This makes swopping the data sets a little bit more complicated. Well…

The RATS data are grouping very closely in Group 1 and 3 but straying a little in Group 2. This is also nicely reflected in the summary graph, where the standard error is noticeable bigger in Group 2. It can be already guessed from the summary graph, that Group 1 is significantly different to Group 2 and 3 because the standard errors and therefore the 95% confidence interval (which is twice the standard error) would not overlap. However, if 2 and 3 are significantly different needs to be determined with an ANOVA to be sure as they are quite close together in the summary graph.

Before the ANOVA, it should be checked if the data are already in the most representative way or if there are outliers that can be eliminated. After checking on that in the RATS data, it becomes obvious that each group has one outlier that can be excluded from the data set. After that the boxplot already highlights, that now all three groups will most likely be significanlty different from each other as the standard error has decreased substantially. The ANOVA confirms the findings.

library(dplyr)
library(tidyr)
library(ggplot2)
BPRS <- read.csv("BPRS.csv", sep = ",", header = TRUE)
glimpse(BPRS)
## Observations: 360
## Variables: 5
## $ treatment <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks     <fct> week0, week0, week0, week0, week0, week0, week0, wee...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
RATS <- read.csv("RATS.csv", sep = ",", header = TRUE)
glimpse(RATS)
## Observations: 176
## Variables: 5
## $ ID     <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group  <int> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ WD     <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, ...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
## $ time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
## # A tibble: 6 x 6
##   ID    Group WD    Weight  time stdWeight
##   <fct> <fct> <fct>  <int> <int>     <dbl>
## 1 1     1     WD1      240     1    -1.73 
## 2 2     1     WD1      225     1    -2.83 
## 3 3     1     WD1      245     1    -1.37 
## 4 4     1     WD1      260     1    -0.271
## 5 5     1     WD1      255     1    -0.637
## 6 6     1     WD1      260     1    -0.271

n <- RATS$time %>% unique() %>% length()
RATSS <- RATS %>%
  group_by(Group, time) %>%
  summarise( mean = mean(Weight), se = sd(Weight)/sqrt(n) ) %>%
  ungroup()
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
ggplot(RATSS, aes(x = time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=2) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin = mean - se, ymax = mean + se, linetype="1"), width=0.3) +
  theme(legend.position = "right") +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

RATS8S <- RATS %>%
  filter(time > 1) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()
glimpse(RATS8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 263.2, 238.9, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, ...
# Plot boxplot
ggplot(RATS8S, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight) without WD1")

# Filter the outliers
RATS8S1 <- RATS8S %>% 
  filter(
    (mean > 250 & Group == 1) |
    (mean < 550 & Group == 2) |
    (mean > 500 & Group == 3)
    )
# Plot again
glimpse(RATS8S1)
## Observations: 13
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID    <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean  <dbl> 263.2, 261.7, 267.2, 270.9, 276.2, 274.6, 267.5, 443.9, ...
ggplot(RATS8S1, aes(x = Group, y = mean)) +
  geom_boxplot() +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight) without WD1")

RATS8S2 <- RATS8S %>%
  mutate(baseline = RATS$WD1)
#fit <- lm(mean ~ baseline + Group, data = RATS8S2)
#anova(fit)

This is the summary of all findings on the BPRS data analysis:

Using a linear mixed effect model on a data set with only two variables is a little overkill but lets follow the exercise. Well, looking into different mixing factors, the subjects always stay significantly different to each other (indicated by the *** below). However, it is noticeable that the standard error, t-value and df can change substantially when canging the mixing effect.

library(lme4)
library(lmerTest)
library(afex)
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRS, REML = FALSE)
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + (week * treatment), data = BPRS, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (1 | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  45.8817     2.4413  92.0743  18.794   <2e-16 ***
## week         -2.2704     0.2084 340.0000 -10.896   <2e-16 ***
## treatment     0.5722     1.0761 340.0000   0.532    0.595    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) week  
## week      -0.341       
## treatment -0.661  0.000
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)  45.8817     2.5685  49.2111  17.863  < 2e-16 ***
## week         -2.2704     0.2977  19.9985  -7.626 2.42e-07 ***
## treatment     0.5722     1.0405 320.0007   0.550    0.583    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##           (Intr) week  
## week      -0.477       
## treatment -0.608  0.000
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood . t-tests use
##   Satterthwaite's method [lmerModLmerTest]
## Formula: bprs ~ week + treatment + (week | subject) + (week * treatment)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0513 -0.6270 -0.0768  0.5287  3.9259 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0076  8.0627        
##           week         0.9686  0.9842   -0.51
##  Residual             96.4708  9.8220        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                Estimate Std. Error       df t value Pr(>|t|)    
## (Intercept)     50.1767     3.5159 141.9981  14.271  < 2e-16 ***
## week            -3.3442     0.6711 253.0304  -4.983 1.16e-06 ***
## treatment       -2.2911     1.9090 319.9966  -1.200   0.2310    
## week:treatment   0.7158     0.4010 319.9966   1.785   0.0752 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) week   trtmnt
## week        -0.768              
## treatment   -0.814  0.753       
## week:trtmnt  0.684 -0.896 -0.840
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRS
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + (week * treatment)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4                           
## BPRS_ref2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#ggplot(BPRS, aes(x = week, y = bprs, group = treatment)) +
  #geom_line(aes(linetype = Group)) +
  #scale_x_continuous(name = "Week", breaks = seq(0, 60, 20)) +
  #scale_y_continuous(name = "bprs") +
  #theme(legend.position = "top")
Fitted1 <- fitted(BPRS_ref1)
BPRS <- BPRS %>%
  mutate(Fitted1)
#ggplot(BPRS, aes(x = week, y = Fitted1, group = treatment)) +
  #geom_line(aes(linetype = subject)) +
  #scale_x_continuous(name = "Week") +
  #scale_y_continuous(name = "Fitted(bprs") +
  #theme(legend.position = "top")